1. Frame

A Portuguese Bank wants to run a direct marketing campaign to sell its new term deposit plan. The goal is to help them identify customers who would most likely buy the plan?

Open Discussion How to approach this problem?

2. Acquire

UCI has a number of datasets related to machine learning. We will leverage the Bank Marketing dataset. Look into this link for more information https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Load the train and test datasets


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 8)

In [3]:
# Load the train dataset
train = pd.read_csv("../Data/train.csv")

In [4]:
# Load the test dataset
test = pd.read_csv("../Data/test.csv")

In [5]:
# View the first 5 records of train
train.head()


Out[5]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

In [6]:
# View the last 10 records of test
test.tail(10)


Out[6]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
9990 62 management divorced tertiary no 5943 no no telephone 17 feb 196 4 -1 0 unknown yes
9991 38 technician single tertiary no 25 yes no cellular 1 jun 232 2 -1 0 unknown yes
9992 25 management single tertiary no 316 no no cellular 27 mar 347 2 -1 0 unknown yes
9993 43 technician divorced unknown no 4389 no no cellular 8 apr 618 1 -1 0 unknown yes
9994 45 admin. divorced secondary no 0 no no cellular 29 oct 264 1 -1 0 unknown yes
9995 78 retired divorced primary no 1389 no no cellular 8 apr 335 1 -1 0 unknown yes
9996 30 management single tertiary no 398 no no cellular 27 oct 102 1 180 3 success yes
9997 69 retired divorced tertiary no 247 no no cellular 22 apr 138 2 -1 0 unknown yes
9998 48 entrepreneur married secondary no 0 no yes cellular 28 jul 431 2 -1 0 unknown yes
9999 31 admin. single secondary no 131 yes no cellular 15 jun 151 1 -1 0 unknown yes

In [7]:
# List the attributes/feature names/columns in train dataset
train.columns


Out[7]:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')

In [8]:
# List the attributes in test dataset. 
test.columns


Out[8]:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')

In [9]:
type(test.columns)


Out[9]:
pandas.indexes.base.Index

In [10]:
train.columns.values


Out[10]:
array(['age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype=object)

In [11]:
test.columns.values


Out[11]:
array(['age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'deposit'], dtype=object)

In [12]:
# Do they match with train?
[x in test.columns.values for x in train.columns.values]


Out[12]:
[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]

Attribute Information:

Input variables:

bank client data:

  1. age (numeric)
  2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
  3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
  4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
  5. default: has credit in default? (categorical: 'no','yes','unknown')
  6. housing: has housing loan? (categorical: 'no','yes','unknown')
  7. loan: has personal loan? (categorical: 'no','yes','unknown')
  1. contact: contact communication type (categorical: 'cellular','telephone')
  2. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  3. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
  4. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes

  1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  2. cons.price.idx: consumer price index - monthly indicator (numeric)
  3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  4. euribor3m: euribor 3 month rate - daily indicator (numeric)
  5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

3. Explore


In [13]:
train.dtypes


Out[13]:
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object

In [14]:
# Find unique values in deposit for train dataset
pd.unique(train.deposit)


Out[14]:
array(['no', 'yes'], dtype=object)

In [15]:
# Find unique values in deposit for test dataset. Are they the same? 
pd.unique(test.deposit)


Out[15]:
array(['no', 'yes'], dtype=object)

In [16]:
pd.unique(test['month'])


Out[16]:
array(['may', 'apr', 'jun', 'jul', 'aug', 'feb', 'nov', 'jan', 'mar',
       'sep', 'oct', 'dec'], dtype=object)

In [17]:
# Find frequency of deposit in train dataset
train.deposit.value_counts()


Out[17]:
no     31092
yes     4119
Name: deposit, dtype: int64

In [18]:
# Find frequency of deposit in test dataset
test.deposit.value_counts()


Out[18]:
no     8830
yes    1170
Name: deposit, dtype: int64

In [19]:
type(train.deposit.value_counts())


Out[19]:
pandas.core.series.Series

In [20]:
# Is the distribution of deposit similar in train and test?
print("train:",train.deposit.value_counts()[1]/train.shape[0]*100)
print("test:",test.deposit.value_counts()[1]/test.shape[0]*100)


train: 11.6980489052
test: 11.7

In [21]:
# Find number of rows and columns in train 
train.shape


Out[21]:
(35211, 17)

In [22]:
# Find number of rows and columns in test
test.shape


Out[22]:
(10000, 17)

Find basic summary metrics for the train dataframe


In [23]:
train.describe()


Out[23]:
age balance day duration campaign pdays previous
count 35211.000000 35211.000000 35211.000000 35211.000000 35211.000000 35211.000000 35211.000000
mean 40.965153 1355.947914 15.802221 258.191048 2.759337 40.104087 0.582659
std 10.651197 3060.839946 8.339288 257.335241 3.098252 100.220917 2.418828
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 71.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 447.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1418.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000

Where did the remaining columns go?

describe only works for continuous variable and not categorical variables

Plots


In [24]:
# Create labels: that has 0 for No and 1 for Yes in train dataset
labels = np.where(train.deposit=="no", 0, 1)

In [25]:
# Display number of 0 and 1 - check if it is the same as what we saw above?
np.unique(labels, return_counts=True)


Out[25]:
(array([0, 1]), array([31092,  4119]))

Bi-variate plot : Deposit vs age


In [26]:
train.head()


Out[26]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

In [27]:
train.loc[:,['deposit','age']]


Out[27]:
deposit age
0 no 58
1 no 44
2 no 33
3 no 47
4 no 33
5 no 28
6 no 42
7 no 58
8 no 43
9 no 41
10 no 29
11 no 53
12 no 57
13 no 45
14 no 57
15 no 60
16 no 33
17 no 28
18 no 32
19 no 25
20 no 44
21 no 39
22 no 52
23 no 36
24 no 57
25 no 49
26 no 60
27 no 59
28 no 51
29 no 57
... ... ...
35181 no 36
35182 yes 62
35183 yes 38
35184 yes 36
35185 yes 34
35186 no 66
35187 no 46
35188 no 63
35189 yes 60
35190 no 59
35191 yes 32
35192 yes 29
35193 no 25
35194 yes 32
35195 yes 75
35196 yes 29
35197 yes 68
35198 yes 25
35199 yes 36
35200 no 34
35201 yes 38
35202 yes 53
35203 yes 34
35204 yes 23
35205 yes 73
35206 yes 25
35207 yes 51
35208 yes 72
35209 no 57
35210 no 37

35211 rows × 2 columns


In [28]:
bivariate_plot_deposit_age = train.loc[:,["deposit", "age"]].copy()

In [29]:
bivariate_plot_deposit_age.head()


Out[29]:
deposit age
0 no 58
1 no 44
2 no 33
3 no 47
4 no 33

In [30]:
bivariate_plot_deposit_age.age.hist()


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1100f49b0>

In [31]:
sns.stripplot(x="deposit", y = "age", data = bivariate_plot_deposit_age,
             jitter = True, alpha = 0.1)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1092d6e48>

Multivariate plot : Deposit vs age and pdays


In [32]:
train.plot(kind="scatter", x = 'age', y = 'pdays', color = labels, alpha = 0.5, s=50)


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1103ff048>

Multivariate plot : Deposit vs day and duration


In [33]:
train.plot(kind="scatter", x = 'day', y = 'duration', color = labels, 
           alpha = 0.5, s=50)


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1109624a8>

4. Refine

covert categorical variables to numeric

Two options:

  1. Label Encoder
  2. One-Hot Encoding

Label Encoding


In [34]:
import sklearn
from sklearn import preprocessing

In [35]:
# Find the columns that are categorical
train.select_dtypes(include=['object'])


Out[35]:
job marital education default housing loan contact month poutcome deposit
0 management married tertiary no yes no unknown may unknown no
1 technician single secondary no yes no unknown may unknown no
2 entrepreneur married secondary no yes yes unknown may unknown no
3 blue-collar married unknown no yes no unknown may unknown no
4 unknown single unknown no no no unknown may unknown no
5 management single tertiary no yes yes unknown may unknown no
6 entrepreneur divorced tertiary yes yes no unknown may unknown no
7 retired married primary no yes no unknown may unknown no
8 technician single secondary no yes no unknown may unknown no
9 admin. divorced secondary no yes no unknown may unknown no
10 admin. single secondary no yes no unknown may unknown no
11 technician married secondary no yes no unknown may unknown no
12 services married secondary no yes no unknown may unknown no
13 admin. single unknown no yes no unknown may unknown no
14 blue-collar married primary no yes no unknown may unknown no
15 retired married primary no yes no unknown may unknown no
16 services married secondary no yes no unknown may unknown no
17 blue-collar married secondary no yes yes unknown may unknown no
18 blue-collar single primary no yes yes unknown may unknown no
19 services married secondary no yes no unknown may unknown no
20 admin. married secondary no yes no unknown may unknown no
21 management single tertiary no yes no unknown may unknown no
22 entrepreneur married secondary no yes yes unknown may unknown no
23 technician single secondary no yes yes unknown may unknown no
24 technician married secondary no no yes unknown may unknown no
25 management married tertiary no yes no unknown may unknown no
26 admin. married secondary no yes yes unknown may unknown no
27 blue-collar married secondary no yes no unknown may unknown no
28 management married tertiary no yes no unknown may unknown no
29 technician divorced secondary no yes no unknown may unknown no
... ... ... ... ... ... ... ... ... ... ...
35181 admin. single tertiary no no no cellular nov failure no
35182 blue-collar married secondary no no no cellular nov success yes
35183 entrepreneur single secondary no no no cellular nov success yes
35184 admin. divorced secondary no yes no cellular nov success yes
35185 blue-collar married secondary no yes no cellular nov success yes
35186 retired married secondary no no no cellular nov failure no
35187 blue-collar married secondary no no no cellular nov failure no
35188 retired married secondary no no no cellular nov success no
35189 services married tertiary no yes no cellular nov success yes
35190 unknown married unknown no no no cellular nov failure no
35191 services single secondary no yes no cellular nov unknown yes
35192 management single secondary no yes no cellular nov success yes
35193 services single secondary no no no cellular nov failure no
35194 blue-collar married secondary no no no cellular nov success yes
35195 retired divorced tertiary no yes no cellular nov failure yes
35196 management single tertiary no no no cellular nov unknown yes
35197 retired married secondary no no no cellular nov success yes
35198 student single secondary no no no cellular nov unknown yes
35199 management single secondary no yes no cellular nov unknown yes
35200 blue-collar single secondary no yes no cellular nov other no
35201 technician married secondary no yes no cellular nov unknown yes
35202 management married tertiary no no no cellular nov success yes
35203 admin. single secondary no no no cellular nov unknown yes
35204 student single tertiary no no no cellular nov unknown yes
35205 retired married secondary no no no cellular nov failure yes
35206 technician single secondary no no yes cellular nov unknown yes
35207 technician married tertiary no no no cellular nov unknown yes
35208 retired married secondary no no no cellular nov success yes
35209 blue-collar married secondary no no no telephone nov unknown no
35210 entrepreneur married secondary no no no cellular nov other no

35211 rows × 10 columns


In [36]:
train_to_convert = train.select_dtypes(include=["object_"]).copy()
test_to_convert = test.select_dtypes(include=["object_"]).copy()

In [37]:
train_np = np.array(train_to_convert)
test_np = np.array(test_to_convert)

In [38]:
for i in range(train_np.shape[1]):
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train_np[:, i]))
    train_np[:,i] = lbl.transform(train_np[:, i])
    test_np[:,i] = lbl.transform(test_np[:, i])

In [39]:
# Display train_np
train_np


Out[39]:
array([[4, 1, 2, ..., 8, 3, 0],
       [9, 2, 1, ..., 8, 3, 0],
       [2, 1, 1, ..., 8, 3, 0],
       ..., 
       [5, 1, 1, ..., 9, 2, 1],
       [1, 1, 1, ..., 9, 3, 0],
       [2, 1, 1, ..., 9, 1, 0]], dtype=object)

In [40]:
# How would you transform test?
test_np


Out[40]:
array([[6, 2, 1, ..., 8, 3, 0],
       [1, 1, 0, ..., 0, 3, 0],
       [5, 1, 1, ..., 6, 3, 0],
       ..., 
       [5, 0, 2, ..., 0, 3, 1],
       [2, 1, 1, ..., 5, 3, 1],
       [0, 2, 1, ..., 6, 3, 1]], dtype=object)

In [41]:
train_np


Out[41]:
array([[4, 1, 2, ..., 8, 3, 0],
       [9, 2, 1, ..., 8, 3, 0],
       [2, 1, 1, ..., 8, 3, 0],
       ..., 
       [5, 1, 1, ..., 9, 2, 1],
       [1, 1, 1, ..., 9, 3, 0],
       [2, 1, 1, ..., 9, 1, 0]], dtype=object)

In [42]:
test_np


Out[42]:
array([[6, 2, 1, ..., 8, 3, 0],
       [1, 1, 0, ..., 0, 3, 0],
       [5, 1, 1, ..., 6, 3, 0],
       ..., 
       [5, 0, 2, ..., 0, 3, 1],
       [2, 1, 1, ..., 5, 3, 1],
       [0, 2, 1, ..., 6, 3, 1]], dtype=object)

In [43]:
# Now, merge the numeric and encoded train variables into one single dataset

In [44]:
train_numeric = np.array(train.select_dtypes(exclude=["object_"]).copy())

In [45]:
train_numeric.shape


Out[45]:
(35211, 7)

In [46]:
train_encoded = np.concatenate([train_numeric, train_np], axis=1)

In [47]:
# Now, merge the numeric and encoded test variables into one single dataset
test_numeric = np.array(test.select_dtypes(exclude=["object_"]).copy())

In [48]:
test_encoded = np.concatenate([test_numeric, test_np], axis=1)

5. Model


In [49]:
# Create train X and train Y

In [50]:
xlen = train_encoded.shape[1]-1

In [51]:
train_encoded_X = train_encoded[:, :xlen]

In [52]:
train_encoded_Y = np.array(train_encoded[:, -1], dtype=float)

In [53]:
train_encoded_Y


Out[53]:
array([ 0.,  0.,  0., ...,  1.,  0.,  0.])

In [54]:
# Create test X
test_encoded_X = test_encoded[:, :xlen]

In [55]:
# Create test Y
test_encoded_Y = np.array(test_encoded[:, -1], dtype=float)

Benchmark Model


In [56]:
model_allzero = test_encoded_Y.copy()

In [57]:
model_allzero = 0

In [58]:
# The mean square error on AllZero model
print("Mean Squared Error on all zero model: %.2f"
      % (np.mean((model_allzero - test_encoded_Y) ** 2)*100))


Mean Squared Error on all zero model: 11.70

First Model: Linear Regression

Y = β0 + β1X1 + β2X2 + … + βn*Xn


In [59]:
from sklearn import linear_model

In [60]:
model_linear = linear_model.LinearRegression()

In [61]:
model_linear.fit(train_encoded_X, train_encoded_Y)


Out[61]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [62]:
# The coefficients
print('Coefficients: \n', model_linear.coef_)


Coefficients: 
 [  1.04646483e-03   2.14190275e-06  -4.43191769e-04   4.82999672e-04
  -3.00268937e-03   4.58783750e-04   7.49324089e-03   1.01976457e-03
   2.07698879e-02   1.55435599e-02  -1.97712387e-02  -8.52556895e-02
  -4.47500171e-02  -3.76204027e-02   4.77309366e-03   2.80968562e-02]

In [63]:
# Prediction
model_linear_prediction = model_linear.predict(test_encoded_X)

In [64]:
model_linear_prediction


Out[64]:
array([ 0.05588253,  0.144221  ,  0.04835766, ...,  0.11550334,
        0.21314076,  0.05215351])

In [65]:
model_linear_prediction = np.where(model_linear_prediction>0.5, 1, 0)

In [66]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_linear.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Squared Error on train: 8.14

In [67]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_linear.predict(test_encoded_X) - test_encoded_Y) ** 2)*100))


Mean Squared Error on test: 8.17

Second Model: L2 Logistic Regression


In [68]:
model_logistic_L2 = linear_model.LogisticRegression()

In [69]:
model_logistic_L2.fit(train_encoded_X, train_encoded_Y)


Out[69]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [70]:
# The coefficients
print('Coefficients: \n', model_logistic_L2.coef_)


Coefficients: 
 [[  6.82530131e-03   1.79331639e-05  -7.18550598e-03   3.87348869e-03
   -1.42900945e-01   3.30391815e-03   8.55022031e-02   7.36735209e-03
    1.79800068e-01   1.42854940e-01  -3.02376976e-01  -9.78798877e-01
   -7.33822319e-01  -6.24494186e-01   2.91691916e-02   1.82317196e-01]]

In [71]:
# Prediction
model_logistic_L2_prediction = model_logistic_L2.predict(test_encoded_X)

In [72]:
np.unique(model_logistic_L2_prediction)


Out[72]:
array([ 0.,  1.])

In [73]:
np.sum(model_logistic_L2_prediction)


Out[73]:
401.0

In [74]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L2.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Squared Error on train: 10.86

In [75]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L2_prediction - test_encoded_Y) ** 2)*100))


Mean Squared Error on test: 10.97

Third Model: L1 Logistic Regression


In [76]:
# Code here. Report evaulation
model_logistic_L1 = linear_model.LogisticRegression(penalty = 'l1')

In [77]:
model_logistic_L1.fit(train_encoded_X, train_encoded_Y)


Out[77]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [78]:
# The coefficients
print('Coefficients: \n', model_logistic_L1.coef_)


Coefficients: 
 [[  1.05637731e-02   1.68072330e-05  -5.62844791e-03   3.93893768e-03
   -1.38350472e-01   3.65055334e-03   8.73469701e-02   8.46801335e-03
    2.39397964e-01   1.83525906e-01  -3.16044934e-01  -1.04517588e+00
   -7.22688905e-01  -6.21400423e-01   3.55617478e-02   2.14720323e-01]]

In [79]:
# Prediction
model_logistic_L1_prediction = model_logistic_L1.predict(test_encoded_X)

In [80]:
np.unique(model_logistic_L1_prediction)


Out[80]:
array([ 0.,  1.])

In [81]:
np.sum(model_logistic_L1_prediction)


Out[81]:
419.0

In [82]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L1.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Squared Error on train: 10.81

In [83]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L1_prediction - test_encoded_Y) ** 2)*100))


Mean Squared Error on test: 11.07

Fourth Model: L1 Logistic Regression - Change value of C


In [84]:
model_logistic_L2C = linear_model.LogisticRegression(C = 2)

In [85]:
model_logistic_L2C.fit(train_encoded_X, train_encoded_Y)


Out[85]:
LogisticRegression(C=2, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [86]:
# The coefficients
print('Coefficients: \n', model_logistic_L2C.coef_)


Coefficients: 
 [[  8.39493786e-03   1.73990030e-05  -6.34146842e-03   3.89666637e-03
   -1.41255272e-01   3.44310785e-03   8.63377831e-02   8.26141345e-03
    1.85344786e-01   1.81572594e-01  -3.18528700e-01  -9.88455139e-01
   -7.29451853e-01  -6.10667371e-01   3.12251656e-02   1.94457931e-01]]

In [87]:
# Prediction
model_logistic_L2C_prediction = model_logistic_L2C.predict(test_encoded_X)

In [88]:
np.unique(model_logistic_L2C_prediction)


Out[88]:
array([ 0.,  1.])

In [89]:
np.sum(model_logistic_L2C_prediction)


Out[89]:
401.0

In [90]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_logistic_L2C.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Squared Error on train: 10.84

In [91]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_logistic_L2C_prediction - test_encoded_Y) ** 2)*100))


Mean Squared Error on test: 10.99

Fifth Model: Decision Tree Model


In [107]:
from sklearn import tree
from sklearn.externals.six import StringIO
# import pydot

In [108]:
model_DT = tree.DecisionTreeClassifier()

In [109]:
#Let's use only the first two columns as features for the model

In [110]:
model_DT.fit(train_encoded_X[:,1:3], train_encoded_Y)


Out[110]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [111]:
# dot_data = StringIO() 
# tree.export_graphviz(model_DT, out_file=dot_data) 
# graph = pydot.graph_from_dot_data(dot_data.getvalue()) 
# graph.write_pdf("dt1.pdf")

In [98]:
# Prediction
model_DT_prediction = model_DT.predict(test_encoded_X[:,1:3])

In [99]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_DT.predict(train_encoded_X[:,1:3]) - train_encoded_Y) ** 2)*100))


Mean Percentage Error on train: 3.65

In [100]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_DT_prediction - test_encoded_Y) ** 2)*100))


Mean Percentage Error on test: 17.75

Decision Tree is prone to overfitting !

Now, use all the features and build the model. Report the accuracy


In [113]:
model_DTAll = tree.DecisionTreeClassifier()

In [117]:
model_DTAll.fit(train_encoded_X, train_encoded_Y)


Out[117]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [118]:
# Prediction
model_DTAll_prediction = model_DTAll.predict(test_encoded_X)

In [119]:
# The mean square error on train
print("Mean Squared Error on train: %.2f"
      % (np.mean((model_DTAll.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Squared Error on train: 0.00

In [120]:
# The mean square error on test
print("Mean Squared Error on test: %.2f"
      % (np.mean((model_DTAll_prediction - test_encoded_Y) ** 2)*100))


Mean Squared Error on test: 12.45

Sixth Model: Random Forest Model


In [101]:
from sklearn.ensemble import RandomForestClassifier

In [ ]:
?RandomForestClassifier

In [102]:
model_RF = RandomForestClassifier()

In [103]:
model_RF.fit(train_encoded_X, train_encoded_Y)


Out[103]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [104]:
# Prediction
model_RF_prediction = model_RF.predict(test_encoded_X)

In [105]:
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RF.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Percentage Error on train: 0.77

In [106]:
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RF_prediction - test_encoded_Y) ** 2)*100))


Mean Percentage Error on test: 10.14

Let's change model parameters

  • Use 400 trees
  • use maximum depth of 8
  • print Out of Bag (OOB) score

In [121]:
?RandomForestClassifier

In [122]:
model_RFMod = RandomForestClassifier(max_depth = 8, oob_score = True, n_estimators = 400 )

In [123]:
model_RFMod.fit(train_encoded_X, train_encoded_Y)


Out[123]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [124]:
# Prediction
model_RFMod_prediction = model_RFMod.predict(test_encoded_X)

In [125]:
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RFMod.predict(train_encoded_X) - train_encoded_Y) ** 2)*100))


Mean Percentage Error on train: 8.29

In [126]:
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RFMod_prediction - test_encoded_Y) ** 2)*100))


Mean Percentage Error on test: 9.92

Cross Validation


In [127]:
from sklearn.cross_validation import StratifiedKFold

In [ ]:
?StratifiedKFold

In [128]:
skf = StratifiedKFold(train_encoded_Y, 5, random_state=1131, shuffle=True)

In [129]:
for train, test in skf:
    print("%s %s" % (train, test))
    print(train.shape, test.shape)


[    0     1     2 ..., 35207 35208 35210] [   10    17    23 ..., 35197 35201 35209]
(28168,) (7043,)
[    0     1     2 ..., 35206 35209 35210] [    4     6    11 ..., 35190 35207 35208]
(28168,) (7043,)
[    0     1     3 ..., 35207 35208 35209] [    2     8    16 ..., 35192 35199 35210]
(28169,) (7042,)
[    2     3     4 ..., 35208 35209 35210] [    0     1     7 ..., 35203 35204 35206]
(28169,) (7042,)
[    0     1     2 ..., 35208 35209 35210] [    3     5     9 ..., 35198 35202 35205]
(28170,) (7041,)

In [130]:
model_RF = RandomForestClassifier()

In [131]:
for k, (train, test) in enumerate(skf):
    model_RF.fit(train_encoded_X[train], train_encoded_Y[train])
    print("fold:", k+1, model_RF.score(train_encoded_X[test], train_encoded_Y[test]))


fold: 1 0.897486866392
fold: 2 0.901320460031
fold: 3 0.903578528827
fold: 4 0.898324339676
fold: 5 0.898593949723

Find mean CV error


In [158]:
cv_error = []
for k, (train, test) in enumerate(skf):
    model_RF.fit(train_encoded_X[train], train_encoded_Y[train])
    print(k)
    cv_error.append(np.mean((model_RFMod.predict(train_encoded_X) - train_encoded_Y) ** 2)*100)


0
1
2
3
4

In [159]:
cv_error


Out[159]:
[8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326,
 8.2900230041748326]

In [163]:
# code here
np.mean(cv_error)


Out[163]:
8.2900230041748326

Run for different parameters, different models and find mean CV error and for different KFolds


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

One Hot Encoding


In [132]:
train = pd.read_csv("../Data/train.csv")
test = pd.read_csv("../Data/test.csv")

In [133]:
train_one_hot = pd.get_dummies(train)

In [134]:
train_one_hot.head()


Out[134]:
age balance day duration campaign pdays previous job_admin. job_blue-collar job_entrepreneur ... month_may month_nov month_oct month_sep poutcome_failure poutcome_other poutcome_success poutcome_unknown deposit_no deposit_yes
0 58 2143 5 261 1 -1 0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 44 29 5 151 1 -1 0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
2 33 2 5 76 1 -1 0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
3 47 1506 5 92 1 -1 0 0.0 1.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 33 1 5 198 1 -1 0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0

5 rows × 53 columns


In [135]:
test_one_hot = pd.get_dummies(test)
test_one_hot.head()


Out[135]:
age balance day duration campaign pdays previous job_admin. job_blue-collar job_entrepreneur ... month_may month_nov month_oct month_sep poutcome_failure poutcome_other poutcome_success poutcome_unknown deposit_no deposit_yes
0 38 677 14 114 2 -1 0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
1 58 5445 14 391 1 -1 0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
2 55 5 20 108 1 -1 0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
3 26 63 28 76 4 -1 0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0
4 48 907 4 103 1 -1 0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0

5 rows × 53 columns


In [136]:
# Check if columns are the same

In [149]:
[x in test_one_hot.columns.values for x in train_one_hot.columns.values]


Out[149]:
[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True]

In [137]:
#Create train X , train Y, test X , test Y

In [138]:
train_X = train_one_hot.ix[:,:train_one_hot.shape[1]-2 ]

In [139]:
train_X.columns


Out[139]:
Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')

In [140]:
train_Y = train_one_hot.ix[:, -1]

In [141]:
train_Y.head()


Out[141]:
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: deposit_yes, dtype: float64

In [142]:
test_X = test_one_hot.ix[:,:test_one_hot.shape[1]-2 ]
test_Y = test_one_hot.ix[:, -1]

In [143]:
# Run Random Forest and check accuracy

In [144]:
model_RF = RandomForestClassifier(n_estimators=400, max_depth=8, oob_score=True, n_jobs=-1)

In [145]:
model_RF.fit(train_X, train_Y)


Out[145]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [146]:
# Prediction
model_RF_prediction = model_RF.predict(test_X)

In [147]:
# The mean square error on train
print("Mean Percentage Error on train: %.2f"
      % (np.mean((model_RF.predict(train_X) - train_Y) ** 2)*100))


Mean Percentage Error on train: 9.87

In [148]:
# The mean square error on test
print("Mean Percentage Error on test: %.2f"
      % (np.mean((model_RF_prediction - test_Y) ** 2)*100))


Mean Percentage Error on test: 10.54